Please answer the following questions as completely as possible using text, code, and the results of code as needed. Format your answers in a Jupyter notebook. To receive full credit, make sure you address every part of the problem, and make sure your document is formatted in a clean and professional way.
In this lab, you will be working with the 2018 General Social Survey (GSS). The GSS is a sociological survey created and regularly collected since 1972 by the National Opinion Research Center at the University of Chicago. It is funded by the National Science Foundation. The GSS collects information and keeps a historical record of the concerns, experiences, attitudes, and practices of residents of the United States, and it is one of the most important data sources for the social sciences.
The data includes features that measure concepts that are notoriously difficult to ask about directly, such as religion, racism, and sexism. The data also include many different metrics of how successful a person is in his or her profession, including income, socioeconomic status, and occupational prestige. These occupational prestige scores are coded separately by the GSS. The full description of their methodology for measuring prestige is available here: http://gss.norc.org/Documents/reports/methodological-reports/MR122%20Occupational%20Prestige.pdf Here's a quote to give you an idea about how these scores are calculated:
Respondents then were given small cards which each had a single occupational titles listed on it. Cards were in English or Spanish. They were given one card at a time in the preordained order. The interviewer then asked the respondent to "please put the card in the box at the top of the ladder if you think that occupation has the highest possible social standing. Put it in the box of the bottom of the ladder if you think it has the lowest possible social standing. If it belongs somewhere in between, just put it in the box that matches the social standing of the occupation."
The prestige scores are calculated from the aggregated rankings according to the method described above.
Import the following packages:
import numpy as np
import pandas as pd
import sidetable
import weighted # this is a module of wquantiles, so type pip install wquantiles or conda install wquantiles to get access to it
from scipy import stats
from sklearn import manifold
from sklearn import metrics
import prince
from pandas_profiling import ProfileReport
pd.options.display.max_columns = None
/home/van8me/.local/lib/python3.8/site-packages/pandas/core/computation/expressions.py:20: UserWarning: Pandas requires version '2.7.3' or newer of 'numexpr' (version '2.7.1' currently installed). from pandas.core.computation.check import NUMEXPR_INSTALLED <ipython-input-1-bebdb0ef9ba1>:9: DeprecationWarning: `import pandas_profiling` is going to be deprecated by April 1st. Please use `import ydata_profiling` instead. from pandas_profiling import ProfileReport
Then load the GSS data with the following code:
%%capture
gss = pd.read_csv("https://github.com/jkropko/DS-6001/raw/master/localdata/gss2018.csv",
encoding='cp1252', na_values=['IAP','IAP,DK,NA,uncodeable', 'NOT SURE',
'DK', 'IAP, DK, NA, uncodeable', '.a', "CAN'T CHOOSE"])
Drop all columns except for the following:
id - a numeric unique ID for each person who responded to the surveywtss - survey sample weightssex - male or femaleeduc - years of formal educationregion - region of the country where the respondent livesage - ageconinc - the respondent's personal annual incomeprestg10 - the respondent's occupational prestige score, as measured by the GSS using the methodology described abovemapres10 - the respondent's mother's occupational prestige score, as measured by the GSS using the methodology described abovepapres10 -the respondent's father's occupational prestige score, as measured by the GSS using the methodology described abovesei10 - an index measuring the respondent's socioeconomic statussatjob - responses to "On the whole, how satisfied are you with the work you do?"fechld - agree or disagree with: "A working mother can establish just as warm and secure a relationship with her children as a mother who does not work."fefam - agree or disagree with: "It is much better for everyone involved if the man is the achiever outside the home and the woman takes care of the home and family."fepol - agree or disagree with: "Most men are better suited emotionally for politics than are most women."fepresch - agree or disagree with: "A preschool child is likely to suffer if his or her mother works."meovrwrk - agree or disagree with: "Family life often suffers because men concentrate too much on their work."Then rename any columns with names that are non-intuitive to you to more intuitive and descriptive ones. Finally, replace the "89 or older" values of age with 89, and convert age to a float data type. [1 point]
columns_to_keep = ['id', 'wtss', 'sex', 'educ', 'region', 'age', 'coninc', 'prestg10',
'mapres10', 'papres10', 'sei10', 'satjob', 'fechld', 'fefam',
'fepol', 'fepresch', 'meovrwrk']
gss = gss[columns_to_keep]
gss['age'] = gss['age'].replace("89 or older", 89)
gss['age'] = gss['age'].astype(float)
gss
| id | wtss | sex | educ | region | age | coninc | prestg10 | mapres10 | papres10 | sei10 | satjob | fechld | fefam | fepol | fepresch | meovrwrk | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2.357493 | male | 14.0 | new england | 43.0 | NaN | 47.0 | 31.0 | 45.0 | 65.3 | very satisfied | strongly agree | disagree | agree | strongly disagree | agree |
| 1 | 2 | 0.942997 | female | 10.0 | new england | 74.0 | 22782.5000 | 22.0 | 32.0 | 39.0 | 14.8 | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 3 | 0.942997 | male | 16.0 | new england | 42.0 | 112160.0000 | 61.0 | 32.0 | 72.0 | 83.4 | mod. satisfied | strongly agree | disagree | disagree | disagree | disagree |
| 3 | 4 | 0.942997 | female | 16.0 | new england | 63.0 | 158201.8412 | 59.0 | NaN | 39.0 | 69.3 | very satisfied | agree | disagree | disagree | disagree | neither agree nor disagree |
| 4 | 5 | 0.942997 | male | 18.0 | new england | 71.0 | 158201.8412 | 53.0 | 35.0 | 45.0 | 68.6 | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2343 | 2344 | 0.471499 | female | 12.0 | new england | 37.0 | NaN | 47.0 | 31.0 | 72.0 | 38.8 | mod. satisfied | disagree | strongly disagree | disagree | strongly disagree | disagree |
| 2344 | 2345 | 0.942997 | female | 12.0 | new england | 75.0 | 22782.5000 | 28.0 | NaN | 27.0 | 21.6 | very satisfied | strongly agree | disagree | disagree | disagree | disagree |
| 2345 | 2346 | 0.942997 | female | 12.0 | new england | 67.0 | 70100.0000 | 40.0 | 45.0 | 53.0 | 41.8 | NaN | NaN | NaN | NaN | NaN | NaN |
| 2346 | 2347 | 0.942997 | male | 16.0 | new england | 72.0 | 38555.0000 | 47.0 | 53.0 | 50.0 | 62.7 | NaN | disagree | agree | disagree | strongly agree | agree |
| 2347 | 2348 | 0.471499 | female | 12.0 | new england | 79.0 | NaN | 33.0 | NaN | 46.0 | 13.6 | very satisfied | strongly disagree | strongly agree | disagree | strongly agree | strongly agree |
2348 rows × 17 columns
profile = ProfileReport(gss, title="GSS 2018 Data Report", explorative=True)
profile.to_notebook_iframe()
Looking through the HTML report you displayed in part a, how many people in the data are from New England? [1 point]
From the data the New England count is 124 with a frequency of 5.3%
Looking through the HTML report you displayed in part a, which feature in the data has the highest number of missing values, and what percent of the values are missing for this feature? [1 point]
Looking through the HTML report you displayed in part a, which two distinct features in the data have the highest correlation? [1 point]
On a primetime show on a 24-hour cable news network, two unpleasant-looking men in suits sit across a table from each other, scowling. One says "This economy is failing the middle-class. The average American today is making less than \$48,000 a year." The other screams "Fake news! The typical American makes more than \$55,000 a year!" Explain, using words and code, how the data can support both of their arguments. Use the sample weights to calculate descriptive statistics that are more representative of the American adult population as a whole. [1 point]
For each of the following parts,
Is there a gender wage gap? That is, is there a difference between the average incomes of men and women? [2 points]
Are there different average values of occupational prestige for different levels of job satisfaction? [2 points]
Report the Pearson's correlation between years of education, socioeconomic status, income, occupational prestige, and a person's mother's and father's occupational prestige? Then perform a hypothesis test for the correlation between years of education and socioeconomic status and provide a specific and accurate intepretation of the $p$-value associated with this hypothesis test beyond "significant or not". [2 points]
Create a new categorical feature for age groups, with categories for 18-35, 36-49, 50-69, and 70 and older (see the module 8 notebook for an example of how to do this).
Then create a cross-tabulation in which the rows represent age groups and the columns represent responses to the statement that "It is much better for everyone involved if the man is the achiever outside the home and the woman takes care of the home and family." Rearrange the columns so that they are in the following order: strongly agree, agree, disagree, strongly disagree. Place row percents in the cells of this table.
Finally, use a hypothesis test that can tell use whether there is enough evidence to conclude that these two features have a relationship, and provide a specific and accurate intepretation of the $p$-value. [2 points]
For this problem, you will conduct and interpret a correspondence analysis on the categorical features that ask respondents to state the extent to which they agree or disagree with the statements:
Conduct a correspondence analysis using the observed features listed above that measures two latent features. Plot the two latent categories for each category in each of the features used in the analysis. [2 points]
Display the latent features for every category in the observed features, sorted by the first latent feature. Describe in words what concept this feature is attempting to measure, and give the feature a name. [2 points]
We can use the results of the MCA model to conduct some cool EDA. For one example, follow these steps:
Use the .row_coordinates() method to calculate values of the latent feature for every row in the data you passed to the MCA in part a. Extract the first column and store it in its own dataframe.
To join it with the full, cleaned GSS data based on row numbers (instead of on a primary key), use the .join() method. For example, if we named the cleaned GSS data gss_clean and if we named the dataframe in step 1 latentfeature, we can type
gss_clean = gss_clean.join(latentfeature, how="outer")
What does this table tell you about the relationship between sex, age, and the latent feature? [2 points]